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Summary. Although there is a rich literature on methods for allowing the variance in a univari- 
ate regression model to vary with predictors, time and other factors, relatively little has been 
done in the multivariate case. Our focus is on developing a class of nonparametric covariance 
regression models, which allow an unknown p x p covariance matrix to change flexibly with 
predictors. The proposed modeling framework induces a prior on a collection of covariance 
matrices indexed by predictors through priors for predictor-dependent loadings matrices in a 
factor model. In particular, the predictor-dependent loadings are characterized as a sparse 
CN ■ combination of a collection of unknown dictionary functions (e.g, Gaussian process random 

functions). The induced covariance is then a regularized quadratic function of these dictionary 
elements. Our proposed framework leads to a highly-flexible, but computationally tractable 
formulation with simple conjugate posterior updates that can readily handle missing data. The- 
oretical properties are discussed and the methods are illustrated through simulations studies 
and an application to the Google Flu Trends data. 

Keywords: covariance estimation; Gaussian process; heteroscedastic regression; 
nonparametric Bayes; stochastic process. 

1 . Introduction 

Spurred by the increasing prevalence of high-dimensional datasets and the computational 
capacity to analyze them, capturing hcteroscedasticity in multivariate processes has become 
a growing focus in many applied domains. For example, within the field of financial time 
series modeling, capturing the time- varying volatility and co-volatility of a collection of risky 
assets is key in devising a portfolio management scheme. Likewise, the spatial statistics 
community is often faced with multivariate measurements (e.g., temperature, precipitation, 
etc.) recorded at a large collection of locations, necessitating methodology to model the 
strong spatial (and spatio-temporal) variations in correlations. More generally, imagine 
that one has some arbitrary, potentially multivariate predictor space X and a collection of 
multivariate response vectors y. The problem of mean regression (i.e., fJ.(x) = E(y \ x)) has 
been well studied in both the univariate and multivariate settings. Although there is a rich 
literature on methods for allowing the variance in a univariate regression model to vary with 
predictors, there is a dearth of methodologies for the general case of multivariate covariance 
regression (i.e., = cov(y | x)). The covariance matrix captures key correlations between 
the elements of the response vector, and the typical assumption of a homoscedastic model 
can have significant impact on inferences. 

Historically, the problem of multivariate covariance regression has typically been ad- 
dressed by standard regression operations on the unconstrained elements of the log or 
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Cholesky decomposition of the covariance (or precision) matrix. For example, Pourahmadi 
[1999] proposes to model elements of chol(£(:r) _1 ) as a linear function of the predictors. 
The weights associated with the ith row have a nice interpretation in terms of the con- 
ditional distribution of yi given y\, 2/2, • • • , y%—i] however, the model is not invariant to 
permutations of the elements of y which is problematic in applications where there does not 
exist a natural ordering. Alternatively, Chiu et al. [1996] consider modeling each element 
of log(£(a;)) as a linear function of the predictor. An issue with this formulation is in the 
interpretability of the model: a submatrix of £(cc) does not necessarily coincide with a sub- 
matrix of the matrix logarithm. Additionally, both the models of Pourahmadi [1999] and 
Chiu et al. [1996] involve a large number of parameters (specifically, dxp(p+l)/2 assuming 
x £ 3? rf .) More recently, Hoff and Niu [2010] propose a covariance regression model in which 
= A + Bxx'B' with A positive definite and B real. This model has interpretable 
parameters and may be equated with a latent factor model, leading to computational ad- 
vantages. However, the model still has key limitations in (i) scaling to large p domains, 
and (ii) flexibility based on the parametric approach. Specifically, the model restricts the 
difference between S(x) and the baseline matrix A to be rank 1. Higher rank models can 
be considered via extensions such as £(x) = A + Bxx'B' + Cxx'C, but this dramatically 
increases the parameterization and requires definition of the maximal rank difference. 

For volatility modeling where the covariate space is typically taken to be discrete time, 
heteroscedasticity has classically been captured via either variants of ARCH [Englc, 2002] or 
stochastic volatility models [Harvey et al., 1994]. The former directly specifies the volatility 
matrix as a linear combination of lagged volatilities and squared returns, which suffers from 
curse of dimensionality, and is typically limited to datasets with 5 or fewer dimensions. 
Alternatively, multivariate stochastic volatility models assume £(£) = AT(t)A', with A real, 
T(t) = diag(exp hit, ■ ■ ■ , exp h p t), and hu independent autoregressive processes. See Chib 
et al. [2009] for a survey of such approaches. More recently, a number of papers have 
examined inducing covariance processes through variants of a Wishart process. Philipov 
and Glickman [2006a] take E(i)" 1 | £(i - 1) - W(n,S t -i) with S t -i = l/n(AV 2 )(£(f _ 
l)~ 1 Y{A 1 / 2 )'\ . Alternatively, a conditionally non-central Wishart distribution is induced 
on the precision matrix in Gourieroux et al. [2009] by taking £(i) = J2^=i x ktx' kt , with 
each Xk independently a first order Gaussian autoregressive process. Key limitations of 
these types of Wishart processes are that posterior computations are extremely challenging, 
theory is lacking (e.g., simple descriptions of marginal distributions), and single parameters 
(e.g., n and v) control the inter- and intra-temporal covariance relationships. Prado and 
West [2010] review alternative models of time-varying covariance matrices for dynamic 
linear models via discounting methods that maintain conjugacy. Central to all of the cited 
volatility models is the assumption of Markov dynamics, limiting the ability to capture long- 
range dependencies and often leading to spiky trajectories. Additionally, these methods 
assume a regular grid of observations that cannot easily accommodate missing values. 

Within the spatial statistics community, the term Wishart process is typically used to 
specify a different formulation than those described herein for volatility modeling. Specif- 
ically, letting £(s) denote the covariance of a p-dimensional observation at geographic lo- 
cation s € K 2 , Gclfand et al. [2004] assume that £(s) = A(s)A(s)' + So with So diagonal 
and T(s) = A(s)A(s)' following a matric-variate Wishart process. This Wishart process 
is such that T(s) -1 = 0£(s)£(s)'0' with £^ ~ GP(0,c,) independently for each l,j and 

fExtending to higher dimensions, Philipov and Glickman [2006b] apply this model to the covari- 
ance of a lower-dimensional latent factor in a standard latent factor model. 
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typically taken to be diagonal. The induced distribution on T(s) is then marginally in- 
verse WishartJ. Posterior computations in this model rely on Metropolis- Hastings proposals 
that do not scale well to dimensions p larger than 2-3 and cannot naturally accommodate 
missing data. In terms of spatio-temporal processes, Lopes ct al. [2008] build upon a stan- 
dard dynamic factor model to develop nonseparable and nonstationary space-time models. 
Specifically, the vector y t of univariate observations y s t at spatial locations s is modeled 
as yt = fa + (3 ft + e* with the components of the latent factors f t independently evolving 
according to a first-order autoregressive process and columns of the factor loadings matrix 
(3 independently drawn from Markov random fields. Key assumptions of this formulation 
are that the observations evolve in discrete time on a regular grid, and that the dynamics of 
the spatio-temporal process are captured by independent random walks on the components 
of the latent factors. 

In this paper, we present a Bayesian nonparametric approach to multivariate covariance 
regression that allows the covariance matrix to change flexibly with predictors and readily 
scales to high-dimensional datasets. The proposed modeling framework induces a prior 
on a collection of covariance matrices = {S(x),x € X} through specification of a 
prior on a predictor-dependent latent factor model. In particular, the predictor-dependent 
loadings are characterized as a sparse combination of a collection of unknown dictionary 
functions (e.g, Gaussian process random functions). The induced covariance is then a 
regularized quadratic function of these dictionary elements. The proposed methodology 
has numerous advantages over previous approaches. By employing collections of continuous 
random functions, we allow for an irregular grid of observations. Similarly, we can easily 
cope with missing data within our framework without relying on imputing the missing 
values. Another fundamental property of the proposed methodology is the fact that our 
combined use of a shrinkage prior with a latent factor model enables us (in theory) to handle 
high-dimensional datasets (e.g., on the order of hundreds of dimensions) in the presence of 
a limited number of observations. Essential in being able to cope with such large datasets in 
practice is the fact that our computations are tractable, based on simple conjugate posterior 
updates. Finally, we are able to state theoretical properties of our proposed prior, such as 
large support. 

The paper is organized as follows. In Section 2, we describe our proposed Bayesian 
nonparametric covariance regression model in addition to analyzing the theoretical prop- 
erties of the model. Section 3 details the Gibbs sampling steps involved in our posterior 
computations. Finally, a number of simulation studies are examined in Section 4, with an 
application to the Google Trends flu dataset presented in Section 5. 

2. Covariance Regression Priors 

2. 1. Notation and Motivation 

Let denote the p x p covariance matrix at "location" x € X. In general, x is an 

arbitrary, possibly multivariate predictor value. In dynamical modeling, x may simply 
represent a discrete time index (i.e., X — {1, . . . , T}) or, in spatial modeling, a geographical 
location (i.e., X = -ft 2 ). Another simple, tractable case is when x represents an ordered 

JMore generally, Gelfand et al. [2004] develop a spatial coregionalization model such that 
cov(j/(si), y{s2)) = p(si — S2)A(si)A(s2)' + £o (i.e., a model with spatial dependencies arising 
in both the covariance and cross covariance). 
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categorical predictor (i.e., X = {1,. . . , iV}). We seek a prior for Y>x = {E(.t),.t <G X}, the 
collection of covariance matrices over the space of predictor values. 

Letting ~ lis our goal is to choose a prior LTs for the collection of covariance 
matrices that has large support and leads to good performance in large p settings. By 
"good" we mean accurate estimation in small samples, taking advantage of shrinkage priors 
and efficient computation that scales well as p increases. We initially focus on the relatively 
simple setting in which 

y t ~ Afp(n(xi),Z(xi)) (1) 

independently for each i. Such a formulation could be extended to settings in which data 
are collected at repeated times for different subjects, as in multivariate longitudinal data 
analysis, by embedding the proposed model within a hierarchical framework. See Section 6 
for a discussion. 



2.2. Proposed Latent Factor Model 

In large p settings, modeling apxp covariance matrix S(a:) over an arbitrary predictor space 
X represents an enormous dimensional regression problem; we aim to reduce dimensional- 
ity for tractability in building a flexible nonparamctric model for the predictor-dependent 
covariances. A popular approach for coping with such high dimensional (non-predictor- 
dependent) covariance matrices S in the presence of limited data is to assume that the 
covariance has a decomposition as AA' + Eo where A is a p x k factor loadings matrix with 
k << p and Eo is a p x p diagonal matrix with non- negative entries. To build in predictor 
dependence, we assume a decomposition 

E(x) = A(x)A(x)' + E , (2) 

where A(x) is a p x k factor loadings matrix that is indexed by predictors x and where 
Eo = diag(crj , . . . , cr^). Assuming initially for simplicity that n(x) = 0, such a decomposition 
is induced by marginalizing out a set of latent factors r\i from the following latent factor 
model: 

Vi = A(xi)r]i + ei 
Vi~Afk(0,h), ei~7V p (0,E ). [ ' 

Here, Xi = (xn, . . . ,Xi q )' is the predictor associated with the ith observation y^. 

Despite the dimensionality reduction introduced by the latent factor model of Eq. (3), 
modeling apxk dimensional predictor-dependent factor loadings matrix A(x) still represents 
a significant challenge for large p domains. To further reduce dimensionality, and following 
the strategy of building a flexible high-dimensional model from simple low-dimensional 
pieces, we assume that each element of A(a;) is a linear combination of a much smaller 
number of unknown dictionary functions £^ : X — > 5R. That is, we propose to let 

A( Xi ) = Qtfa), (4) 

where £ W xL is the matrix of coefficients relating the predictor-dependent factor loadings 
matrix to the set of dictionary functions comprising the L x k dimensional matrix £(x). 
Typically, k « p and L « p. Since we can write 

L 

[A(-)]r. = 5>r/6.(0, (5) 
1=1 
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we see that the weighted sum of the sth column of dictionary functions Cs('), with weights 
specified by the rth row of O, characterizes the impact of the sth latent factor on yi r , the rth 
component of the response at predictor location Xi. By characterizing the elements of A(xj) 
as a linear combination of these flexible dictionary functions, we obtain a highly-flexible 
but computationally tractable formulation. 

In marginalizing out the latent factors, we now obtain the following induced covariance 
structure 



Note that the above decomposition of E(x) is not unique and there are actually infinitely 
many such equivalent decompositions. For example, take Ox = c0 and = (l/c)£(-). 
Alternatively, consider £i(-) = £(-)P for any orthogonal matrix P or 81 = [6 pX d] and 
£x = [£; £0] f° r an Y dxk matrix of dictionary functions £o- One can also increase the dimen- 
sion of the latent factors and take £1 = [£ Olx<i]- In standard (non-prcdictor-dcpcndent) 
latent factor modeling, a common approach to ensure identifiability is to constrain the 
factor loadings matrix to be block lower triangular with strictly positive diagonal ele- 
ments [Geweke and Zhou, 1996], though such a constraint induces order dependence among 
the responses [Aguilar and West, 2000, West, 2003, Lopes and West, 2004, Carvalho et al., 
2008]. However, for tasks such as inference on the covariance matrix and prediction, iden- 
tifiability of a unique decomposition is not necessary. Thus, we do not restrict ourselves 
to a unique decomposition of E(x), allowing us to define priors with better computational 
properties. 

Although we are not interested in identifying a unique decomposition of E(x), we are 
interested in characterizing the class of covariance regressions E(x) that can be decomposed 
as in Eq. (6). Lemma 2.1 states that for L and k sufficiently large, any covariance regression 
has such a decomposition. For L,k > p, let X^ denote the space of all L x k dimensional 
matrices of arbitrarily complex dictionary functions mapping from X — > !ft, X^ be the 
space of all p x p diagonal matrices with no n- negative entries, and Xq be the space of all 
p x L dimensional matrices such that 00' has finite elements. 

Lemma 2.1. Given a symmetric positive semidefinite matrix E(x) >- 0, Va; G X, there 
exists {£(•)) 0, Eq} G Xq <H> X^ such that 



Proof. Assume without loss of generality that T,q = pX p and take k,L>p. Consider 



cov( yi \xi = x) = E(a;) = 6£(x)£(x)'e' + E . 



(6) 



s(x) = e£(x)£(x)'e' + Eo, yxex. 



(7) 



= [I p OpxL 



P 




(8) 



Then, E(x) = QZ(x)£(x)'Q', Vx e X. 



Now that we have established that there exist decompositions of E(x) into the form specified 
by Equation (6), the question is whether we can specify a prior on the elements £(■), 0, and 
E that provides large support on such decompositions. This is explored in Section 2.3. 
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In order to generalize the model to also allow the mean n(x) to vary flexibly with 
predictors, we can follow a nonparametric latent factor regression approach and let 

r)i = ip(xi) + Pi, Vi ~ Wfc(0, h), (9) 

where ip(xi) = [ipi(xi), . . . ,ij)k(xi)]' , and tpj : X — > 5ft is an unknown function relating 
the predictors to the mean of the jth factor, for j = 1, . . . ,k. These functions can be 
modeled in a related manner to the ^fc(-) functions described above. The induced mean of yi 
conditionally on Xi = x and marginalizing out the latent factors is then fx(x) = Q£(x)t/)(x). 
For simplicity, however, we focus our discussions on the case where /z(x) = 0. 

2.3. Prior Specification 

Working within a Bayesian framework, we place independent priors on £(•), 0, and So in 
Eq. (6) to induce a prior on Yix- Let IIj, lie, and Ii£ denote each of these independent 
priors, respectively. Recall that II £ denotes the induced prior on T,x- 

Aiming to capture covariances that vary continuously over X combined with the goal of 
maintaining simple computations for inference, we specify the dictionary functions as 

ett(-)~GP(0,c) (10) 

independently for all £, k, with c a squared exponential correlation function having c(£, £') = 
«p(-«||£-£'||§). 

To cope with the fact that the number of latent dictionary functions is a model choice 
we are required to make, we seek a prior lie that favors many values of being close to 
zero so that we may choose L larger than the expected number of dictionary functions (also 
controlled by the latent factor dimension k). As proposed in Bhattacharya and Dunson 
[2010], we use the following shrinkage prior: 

On I fat, n ~ Af(0, fa t ~ Ga(3/2, 3/2) 

e 

5i~Ga(oi, 1), 6h ~ Ga(a 2 , 1), h > 2, n = fa 

Choosing a2 > 1 implies that 5^ is greater than 1 in expectation so that Tn tends stochas- 
tically towards infinity as I goes to infinity, thus shrinking the elements 9jt toward zero 
increasingly as I grows. The fag precision parameters allow for flexibility in how the ele- 
ments of are shrunk towards zero by incorporating local shrinkage specific to each element 
of 0, while Tg provides a global column- wise shrinkage factor. 

Finally, we specify Ils via the usual inverse gamma priors on the diagonal elements of 
So- That is, 

aj 2 ~Ga(a CT ,6 CT ) (12) 

independently for each j = 1 , . . . , p. 

2.4. Theoretical Properties 

In this section, we explore the theoretical properties of the proposed Bayesian nonparametric 
covariance regression model. In particular, we focus on the support of the induced prior 



(11) 
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Lis based on the priors LI^, lie, an d Tl^ defined in Section 2.3. Large support implies 
that the prior can generate covariance regressions that are arbitrarily close to any function 
{£(#), x G X} in a large class. Such a support property is the defining feature of a Bayesian 
nonparametric approach and cannot simply be assumed. Often, seemingly flexible models 
can have quite restricted support due to hidden constraints in the model and not to real prior 
knowledge that certain values are implausible. Although we have chosen a specific form for 
a shrinkage prior lie , we aim to make our statement of prior support as general as possible 
and thus simply assume that lie satisfies a set of two conditions given by Assumption 2.1 
and Assumption 2.2. In Lemma 2.2, we show that the lie specified in Eq. (11) satisfies these 
assumptions. The proofs associated with the theoretical statements made in this section 
can be found in the Appendix. 

Assumption 2.1. lie is such that ^2tE[\6jt\] < oo. This property ensures that the 
prior on the rows of shrinks the elements towards zero fast enough as i — > oo . 

Assumption 2.2. lie is such that U e (rank(Q) = p) > 0. That is, there is positive 
prior probability of being full rank. 

The following theorem shows that, for k > p and as L — > oo, the induced prior Lis places 
positive probability on the space of all covariance functions S*(x) that are continuous on 



Theorem 2.1. Let Lis denote the induced prior on {E(x),x £ X} based on the specified 
prior LI^ <E> Lie ® LIs on X^ <E> Xq g§ X-^ . Assuming X compact, for all continuous 2* (a;) 
and for all e > 0, 



Intuitively, the support on continuous covariance functions S*(x) arises from the continuity 
of the Gaussian process dictionary functions. However, since we are mixing over infinitely 
many such dictionary functions, we need the mixing weights specified by to tend towards 
zero, and to do so "fast enough" — this is where Assumption 2.1 becomes important. Sec 
Theorem 2.2. The proof of Theorem 2.1 relies on the large support of Lis at any point 
xq g X. Since each £,ek{xo) is independently Gaussian distributed (based on properties 
of the Gaussian process prior), £(xo)£(xo)' i s Wishart distributed. Conditioned on 0, 
O£(xo)£(xo)'0' is also Wishart distributed. More generally, for fixed 0, 0£(x)£(x)'0' 
follows the matric-variate Wishart process of Gelfand et al. [2004]. Combining the large 
support of the Wishart distribution with that of the gamma distribution on the inverse 
elements of Eo provides the desired large support of the induced prior Lis at each predictor 
location xo- 

Theorem 2.2. For every finite k and L — > oo (or L finite), A(-) = 0£(-) is almost 
surely continuous on X. 

Lemma 2.2 specifics the conditions under which the prior Lie specified in Eq. (11) 
satisfies Assumption 2.1, which provides a sufficient condition used in the proof of prior 
support. 



X. 




(13) 
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Lemma 2.2. Based on the prior specified in Eq. (11) and choosing ai > 2, Assump- 
tion 2.1 is satisfied. That is, Eg < oo. 

It is also of interest to analyze the moments associated with the proposed prior. As 
detailed in the Appendix, the first moment can be derived based on the implied inverse 
gamma prior on the ct| combined with the fact that 0£(x)£(x)'0' is marginally Wishart 
distributed at every location x, with the prior on specified in Equation (11). 

Lemma 2.3. Let fi a denote the mean of en, j = 1, ...,]?. Then, 

E[E(x)] = diag ^k ^ ^u^ 1 + /v, . . . , ^t^ 1 + fi^j . (14) 



Since our goal is to develop a covariance regression model, it is natural to consider 
the correlation induced between an element of the covariance matrix at different predictor 
locations x and x' . 

Lemma 2.4. Let a%. denote the variance of a?, j = 1, . . . ,p. Then, 

^ . , ^ , _ / kc(x, x>) (5 <f>uV + (Ei <fe V 1 ) 2 ) + °l * = 3, 

cov^(x), £ y (x )) - I k<x> x>) ^ ^- V -i r - 2 + ^ -i r -i ^ ^-i T -i) z h 

(15) 

For any two elements Ejj(x) and T, uv (x') with i ^ u or j ^ v, 

cov(T, ij (x),'E uv (x')) = 0. (16) 



We can thus conclude that in the limit as the distance between the predictors tends 
towards infinity, the correlation decays at a rate defined by the Gaussian process kernel 
c(x, x') with a limit: 

^^^cov^,^),^^)) = { * 1 m 

It is perhaps counterintuitive that the correlation between £jj(x) and £,i(x') does not go 
to zero as the distance between the predictors x and x' tends to infinity. However, although 
the correlation between £(x) and £(x') goes to zero, the diagonal matrix Eo does not depend 
on x or x' and thus retains the correlation between the diagonal elements of £(x) and E(x'). 

Equation (15) implies that the autocorrelation ACF(x) = corr(£jj(0), £y (x)) is simply 
specified by c(0, x). When we choose a Gaussian process kernel c(x, x') = exp(— k||x— x'|||), 
we have 

ACF(x)=exp(-K\\x\\l). (18) 

Thus, we see that the length-scale parameter k directly determines the shape of the auto- 
correlation function. 

Finally, one can analyze the stationarity properties of the proposed covariance regression 
prior. 
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Lemma 2.5. Our proposed covariance regression model defines a first-order stationary 
process in that IIs(S(x)) = Vx,x' € X. Furthermore, the process is wide sense 

stationary: cou(£y (x), Y, uv (x')) solely depends upon \\x — x'\\. 

PROOF. The first-order stationarity follows immediately from the stationarity of the 
Gaussian process dictionary elements and recalling that = &£(x)£(x)'®' + So- 

Assuming a Gaussian process kernel c(x,x') that solely depends upon the distance between 
x and x' (as in Section 2.3), Equations (15)- (16) imply that the defined process is wide 
sense stationary. 



3. Posterior Computation 

3. 1. Gibbs Sampling with a Fixed Truncation Level 

Based on a fixed truncation level L* and a latent factor dimension k*, we propose a Gibbs 
sampler for posterior computation. The derivation of Step 1 is provided in the Appendix. 

Step 1 Update each dictionary function £^ m (-) from the conditional posterior given {yi}, 
O, {iji}, Tig. We can rewrite the observation model for the jth component of the ith response 
as 



fc* L* 

Uij = ^2 Vim X] Qjdlm(Xi) + £i 
m=l 1=1 



(19) 



Conditioning on £(■) lm = {£ r s(")i r ^> s m }i our Gaussian process prior on the dictio- 
nary functions implies the following conditional posterior 



£,lm{xi) 
^m(^n) 



mm ELi QjtVj 2 mj 



£ > 



(20) 



where jjij = yij—J2( r s)^(t m) 0jr£rs{xi) and, taking K to be the Gaussian process covariance 
matrix with K+j = c(xi ,Xj), 



^ 1 = K 1 + dia S I Vl m H ■■ 



-2 2 /)2 



(21) 



i=i 



S^ep i? Next we sample each latent factor r\i given j/,-, £(•), O, So- Recalling Eq. (3) and 
the fact that r\i ~ A/fc* (0, /fc*), 

»7i I Vi, s 

~A4* ((l + e(xOWEo 1 0^(^))" 1 ^)'0 / So 1 y i ,(/ + e(^)'e / S o - 1 e^ i ))" 1 ^ . (22) 
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Step 3 Let 9 3 . = \6ji ■■■ 0jL*]- Recalling the Ga(o CT ,& (T ) prior on each precision pa- 
rameter <jj 2 associated with the diagonal noise covariance matrix So, standard conjugate 
posterior analysis yields the posterior 



of I M.e^o-Gi 



(23) 



Step 4 Conditioned on the hypcrparameters <f> and r, the Gaussian prior on the elements of 
8 specified in Eq. (f 1) combined with the likelihood defined by Eq. (3) imply the following 
posterior for each row of 0: 









~yif 






4>,T~Af L * 



















(24) 



where fj' = [^(xi)r)i £(x 2 )r)2 



£(x n )r) n ] and 



cr- 77'?) + diag(^m, . . . , (t>jL*T L *)- 



(25) 



Siep 5 Examining Eq. (11) and using standard conjugate analysis results in the following 
posterior for each local shrinkage hypcrparameter <f>ji given 9jg and Tf. 



U ~Ga 2 



3 + t^V 



(26) 



Step 6 As in Bhattacharya and Dunson [2010], the global shrinkage hyperparameters are 
updated as 



e=i 3=1 



\ e=i 3=1 J 



where t\ ' l) = ]lt=i,t#/i S t for h = 1, . . . ,p. 



(27) 



3.2. Incorporating nonparametric mean n(x) 

If one wishes to incorporate a latent factor regression model such as in Eq. (9) to induce 
a predictor-dependent mean /i(x), the MCMC sampling is modified as follows. Steps 1, 3, 
4, 5, and 6 are exactly as before. Now, however, the sampling of r\i of Step 2 is replaced 
by a block sampling of ip( x i) and v^. Specifically, let f2i = 0£(xi). We can rewrite the 
observation model as yi = Qiip(xi) + Qji/; + e^. Marginalizing out yi = Cliipfai) + w. 
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with uji ~ A/"(0, S, = fijQ^ + So)- Assuming nonparametric mean vector components 
ipt(-) ~ GP(0,c), the posterior of V^(") follows analogously to that of £(•) resulting in 



I {w} ) ^(-)-',e,» 7 ^o),Eo~Ar n s v , 













I S^ 









(28) 



where ^ = yi — X](y/£) I^d-rVv^i)- Once again taking isT to be the Gaussian process 
covariancc matrix, 



(29) 



t- 1 = K- 1 + diag [p^t^l^U, . . . , [O^E" 1 ^]. 



Conditioned on ip(xi), we consider yi = yi — Qiip(xi) = fl;^ + e^. Then, using the fact 
that Ui ~^(0,J fc *), 

vi | K,i)>(ii),9,((i,),Eo 

((/ + e(x J )'e's 1 ee(x l )) _1 e(^)'e'So 1 ^,(/ + e(^)'e'So 1 ee(^)) _1 ) • (so) 



3.3. Hyperparameter Sampling 

One can also consider sampling the Gaussian process length-scale hyperparameter re. Due 
to the linear-Gaussianity of the proposed covariance regression model, we can analytically 
marginalize the latent Gaussian process random functions in considering the posterior of 
re. Once again taking fJ,(x) = for simplicity, our posterior is based on marginalizing the 
Gaussian process random vectors £^ TO = [£,em(xi) ■ ■ ■ £tm( x n)Y- Noting that 



[vi y'2 



n] ' = [ dia S( ? ?-™) ® H &rn + [e'i 4 



(31) 



Cm 



and letting K K denote the Gaussian process covariance matrix based on a length-scale re, 
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+ 



t.m 



S 
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S 
(32) 

We can then Gibbs sample re based on a fixed grid and prior p(k) on this grid. Note, 
however, that computation of the likelihood specified in Eq. (32) requires evaluation of 
an np-dimensional Gaussian for each value re specified in the grid. For large p scenarios, 
or when there are many observations y%, this may be computationally infeasible. In such 
cases, a naive alternative is to iterate between sampling £(•) given K K and K K given £(•). 
However, this can lead to extremely slow mixing. Alternatively, one can consider employing 
the recent Gaussian process hyperparameter slice sampler of Adams and Murray [2011]. 

In general, because of the quadratic mixing over Gaussian process dictionary elements, 
our model is relatively robust to the choice of the length-scale parameter and the computa- 
tional burden imposed by sampling re is typically unwarranted. Instead, one can pre-select 
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Table 1. Computations required at each Gibbs sampling step. 



Gibbs Update 


Computation 


Step 1 


L* x k* draws from an n dimensional Gaussian 


Step 2 


n draws from a k* dimensional Gaussian 


Step 3 


p draws from a gamma distribution 


Step 4 


p draws from an L* dimensional Gaussian 


Step 5 


p x L* draws from a gamma distribution 


Step 6 


L* draws from a gamma distribution 



a value for k using a data-driven heuristic, which leads to a quasi-empirical Bayes approach. 
Recalling Equation (18), we have 

-\og(ACF(x))=K\\x\\l (33) 

Thus, if one can devise a procedure for estimating the autocorrelation function from the 
data, one can set K accordingly. We propose the following. 

1. For a set of evenly spaced knots Xk G X ', compute the sample covariance E(xfc) from 
a local bin of data yk-k :k+k with fco > f>/2. 

2. Compute the Cholesky decomposition C(xk) = chol(E(xk))- 

3. Fit a spline through the elements of the computed C(xk). Denote the spline fit of the 
Cholesky by C{x) for each x E X 

4. For i = 1, . . . , n, compute a point-by-point estimate of E(xj) from the splines: Ti{x{) = 
C(xi)C(xi)'. 

5. Compute the autocorrelation function of each element Sjj(x) of this kernel-estimated 
E(ar). 

6. According to Equation (33), choose k to best fit the most correlated Ejj (x) (since less 
correlated components can be captured via weightings of dictionary elements with 
stronger correlation.) 

3.4. Computational Considerations 

In choosing a truncation level L* and latent factor dimension k* , there are a number of 
computational considerations. The Gibbs sampler outlined in Section 3.1 involves a large 
number of simulations from Gaussian distributions, each of which requires the inversion 
of an m-dimcnsional covariance matrix, with m the dimension of the Gaussian. For large 
m, this represents a large computational burden as the operation is, in general, 0(m 3 ). 
The computations required at each stage of the Gibbs sampler arc summarized in Table 1. 
From this table we sec that depending on the number of observations n and the dimension 
of these observations p, various combinations of L* and k* lead to more or less efficient 
computations. 

In Bhattacharya and Dunson [2010], a method for adaptively choosing the number of 
factors in a non-predictor dependent latent factor model was proposed. One could directly 
apply such a methodology for adaptively selecting L* . To handle the choice of k* , one could 
consider an augmented formulation in which 



A(a:) = Q£(x)r, 



(34) 
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where T = diag(7i, . . . ,7fc) is a diagonal matrix of parameters that shrink the columns of 
£(x) towards zero. One can take these shrinkage parameters to be distributed as 

i 

7^^(0,0, cji = l[( h 

h=i ^ 
Ci~Ga(a 3 ,l), 0i~Ga(a 4 ,l) h = 2,...,k. 

For 04 > 1, such a model shrinks the 7, values towards zero for large indices i just as 
in the shrinkage prior on 0. The 7^ close to zero provide insight into redundant latent 
factor dimensions. Computations in this augmented model are a straightforward extension 
of the Gibbs sampler presented in Section 3.1. Based on the inferred values of the latent ji 
parameters, one can design an adaptive strategy similar to that for L*. 

Note that in Step 1, the n-dimensional inverse covariance matrix E^ 1 which needs to 
be inverted in order to sample is a composition of a diagonal matrix and an inverse 
covariance matrix K~~ x that has entries that tend towards zero as ||xi — Xj\\2 becomes 
large (i.e., for distant pairs of predictors.) That is, K" 1 (and thus S^" 1 ) is nearly band- 
limited, with a bandwidth dependent upon the Gaussian process parameter n. Inverting 
a given n x n band- limited matrix with bandwidth d « n can be efficiently computed in 
0(m 2 d) [Kavcic and Moura, 2000] (versus the naive 0(m 3 )). Issues related to tapering the 
elements of K~ x to zero while maintaining positive-semidefmitcness are discussed in Zhang 
and Du [2008]. 

4. Simulation Example 

In the following simulation examples, we aim to analyze the performance of the proposed 
Bayesian nonparametric covariance regression model relative to competing alternatives in 
terms of both covariance estimation and predictive performance. Wc initially consider the 
case in which S(x) is generated from the assumed nonparametric Bayes model in Section 4.1 
and 4.2, while in Section 4.3 we simulate from a parametric model and compare to a Wishart 
matrix discounting method [Prado and West, 2010] over a set of replicates. 



4.1. Estimation Performance 

We simulated a dataset from the model as follows. The set of predictors is a discrete set 
X = {1, . . . , 100}, with a 10-dimcnsional observation yi generated for each Xi G X. The 
generating mechanism was based on weightings of a latent 5x4 dimensional matrix £(•) 
of Gaussian process dictionary functions (i.e, L = 5, k = 4), with length-scale k = 10 and 
an additional nugget effect adding le~ 5 I n to K. Here, we first scale the predictor space to 
(0, 1]. The additional latent mean dictionary elements ip(-) were similarly distributed. The 
weights O were simulated as specified in Eq. (11) choosing a\ = 02 = 10. The precision 
parameters a~ 2 were each drawn independently from a Ga(l,0.1) distribution with mean 
10. Figure 1 displays the resulting values of the elements of fi(x) and E(x). 

For inference, we set the hyperparameters as follows. We use truncation levels k* = 
L* = 10, which we found to be sufficiently large from the fact that the last few columns of 
the posterior samples of 6 were consistently shrunk close to 0. We set a\ = a-2 = 2 and 
placed a Ga(l,0.1) prior on the precision parameters crj 2 . The length-scale parameter k 
was set from the data according to the heuristic described in Section 3.3 using 20 knots 
evenly spaced in X = {1, . . . , 100}, and was determined to be 10 (after rounding). 



1 4 Fox and Dunson 




(a) 



(b) 



Fig. 1. Plot of each component of the (a) true mean vector n(x) and (b) true covariance matrix E(a;) 
over the predictor space X = {1, . . . , 100}, taken here to represent a time index. 
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(a) (b) (c) 

Fig. 2. Residuals between each component of the true and posterior mean of (a) the mean n(x), 
and (b) covariance E(x). The scaling of the axes matches that of Figure 1 . (c) Box plot of posterior 
samples of the noise covariance terms of for j = 1, . . . ,p compared to the true value (green). 




Although experimentally we found that our sampler was insensitive to initialization 
in lower-dimensional examples such as the one analyzed here, we use the following more 
intricate initialization for consistency with later experiments on larger datasets in which 
mixing becomes more problematic. The predictor-independent parameters and Eo are 
sampled from their respective priors (first sampling the shrinkage parameters 4>jt and Sh 
from their priors). The variables ?y, and are set via a data-driven initialization scheme 
in which an estimate of E(:Cj) for i = l,...,n is formed using Steps 1-4 of Section 3.3. 
Then, Q^(xi) is taken to be a fc*-dimcnsional low-rank approximation to the Cholcsky 
of the estimates of £(x,). The latent factors rji are sampled from the posterior given in 
Equation (22) using this data-driven estimate of 0£(xi). Similarly, the £(#,) are initially 
taken to be spline fits of the pseudo-inverse of the low-rank Cholesky at the knot locations 
and the sampled 0. We then iterate a couple of times between sampling: (i) £(■) given 
{yi}, Q, Eo, and the data-driven estimates of rj, £(•); (ii) 9 given {?/,:}, Eo, rj, and the 
sampled £(•); (hi) Eo given {?/,}, 8, rj, and £(•); and (iv) determining a new data-driven 
approximation to £(•) based on the newly sampled 0. Results indistinguishable from those 
presented here were achieved (after a short burn-in period) by simply initializing each of 0, 
£(•), Eo, r)i, and the shrinkage parameters and Sh from their respective priors. 

We ran 10,000 Gibbs iterations and discarded the first 5,000 iterations. We then thinned 
the chain every 10 samples. The residuals between the true and posterior mean over all 
components are displayed in Figure 2(a) and (b). Figure 2(c) compares the posterior samples 
of the elements a? of the noise covariance E to the true values. Finally, in Figure 3 we 
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Fig. 3. Plots of truth (red) and posterior mean (green) for select components of the mean n P (x) (left), 
variances T, pp (x) (middle), and covariances E M (x) (right). The point-wise 95% highest posterior 
density intervals are shown in blue. The top row represents the component with the lowest L2 error 
between the truth and posterior mean. Likewise, the middle row represents median L2 error and 
the bottom row the worst L2 error. The size of the box indicates the relative magnitudes of each 
component. 

display a select set of plots of the true and posterior mean of components of /i(x) and £(x), 
along with the 95% highest posterior density intervals computed at each predictor value 
x = 1,...,100. 

From the plots of Figures 2 and 3, we see that we are clearly able to capture het- 
eroscedasticity in combination with a nonparametric mean regression. The true values of 
the mean and covariance components are (even in the worst case) contained within the 95% 
highest posterior density intervals, with these intervals typically small such that the overall 
interval bands are representative of the shape of the given component being modeled. 

4.2. Predictive Performance 

Capturing heteroscedasticity can significantly improve estimates of the predictive distri- 
bution of new observations or missing data. To explore these gains within our proposed 
Bayesian nonparametric covariance regression framework, we compare against two possi- 
ble homosccdastic formulations that each assume y ~ Af(n(x), E). The first is a standard 
Gaussian process mean regression model with each element of fx(x) an independent draw 
from GP(0, c). The second builds on our proposed regularized latent factor regression model 
and takes fjt(x) = Q^(x)^(x), with {Q,£(x),ip(x)} as defined in the heteroscedastic case. 
However, instead of having a predictor-dependent covariance E(x) = Q£(x)£(x)'Q' + E , 
the homosccdastic model assumes that E is an arbitrary covariance matrix constant over 
predictors. By comparing to this latter homoscedastic model, we can directly analyze the 
benefits of our heteroscedastic model since both share exactly the same mean regression 
formulation. For each of the homoscedastic models, we place an inverse Wishart prior on 
the covariance E. 

We analyze the same simulation dataset as described in Section 4.1, but randomly remove 
approximately 5% of the observations. Specifically, independently for each clement yij (i.e., 
the jth response component at predictor x,-) we decide whether to remove the observation 
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Table 2. Average Kullback-Leibler divergence D K L(Pi, m \\Qi), i = 1, • • • , 100, where P^ m and Q, 
are the predictive distributions of all missing elements y tJ given the observed elements of %n based 
on the mth posterior sample of and true parameters fi(xi) and E(aij), respectively. We compare the 
predictive performance for two homoscedastic models to our covariance regression framework. 



Model 


Average Posterior Predictive KL Divergence 


Homoscedastic Mean Regression 
Homoscedastic Latent Factor Mean Regression 
Heteroscedastic Mean Regression 


0.3409 
0.2909 
0.1216 



based on a Bernoulli(pi) draw. Wc chose pi to be a function of the matrix norm of the 
true covariance at a;, to slightly bias towards removing values from predictor regions with a 
tighter distribution. This procedure resulted in removing 48 of the 1000 response elements. 

Table 2 compares the average Kullback-Leibler divergence DKL{Pi.m\\Qi); i = 1,...,100, 
for the following definitions of Pi. m and Qi. The distribution Qi is the predictive distribu- 
tion of all missing elements yij given the observed elements of yi under the true parameters 
)i(xi) and Ti(xi). Likewise, P,. m is taken to be the predictive distribution based on the 
mth posterior sample of fi(xi) and S(a;j). In this scenario, the missing observations y^ are 
imputed as an additional step in the MCMC computations §. The results, once again based 
on 10,000 Gibbs iterations and discarding the first 5,000 samples, clearly indicate that our 
Bayesian nonparametric covariance regression model provides more accurate predictive dis- 
tributions. We additionally note that using a regularized latent factor approach to mean 
regression improves on the naive homoscedastic model in high dimensional datasets in the 
presence of limited data. Not depicted in this paper due to space constraints is the fact 
that the proposed covariance regression model also leads to improved estimates of the mean 
Ijl{x) in addition to capturing hcterosccdasticity. 

4.3. Model Mismatch 

We now examine our performance over a set of replicates from a 30-dimensional parametric 
heteroscedastic model. To generate the covariance over X = {1, . . . , 500}, we chose a 
set of 5 evenly spaced knots Xk = 1, 125, 250, 375, 500 and generated 

S(x fc )~AA(0,£ s ) (36) 

with S s = Yjjti s i s 'j and s 3 ~ AA([-29 - 28 ... 28 29]', J 30 ). This construction implies 
that S(xk) and S(x' k ) are correlated. We then fit a spline 5V, (-) independently through each 
element Sij(xk) and evaluate this spline fit at x = 1, ... , 500. The covariance is constructed 
as 

E(a?) = aS(x)S(x)' + S , (37) 

where So is a diagonal matrix with a truncated-normal prior, TAf (0,1), on its diagonal 
elements. The constant a is chosen to scale the maximum value of aS(x)S(x)' to 1. The 
resulting covariance is shown in Figure 4(a). 

§Note that it is not necessary to impute the missing j/y within our proposed Bayesian covariance 
regression model because of the conditional independencies at each Gibbs step. In Section 5, we 
simply sample based only on actual observations. Here, however, we impute in order to directly 
compare our performance to the homoscedastic models. 
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(a) (b) (c) 

Fig. 4. (a) Plot of each component of the true covariance matrix E(x) over the predictor space 
X — {1, . . . , 500}, taken to represent a time index. Analogous plot for the mean estimate of E(i) are 
shown for (b) our proposed Bayesian nonparametric covariance regression model based on Gibbs 
iterations 5000 to 10000, and (c) a Wishart matrix discounting model over 100 independent FFBS 
samples. Both mean estimates of E(a;) are for a single replicate {y t (1) }. Note that the scaling of (c) 
crops the large estimates of H(x) for x near 500. 

Each replicate m = 1, . . . , 30 of this parametric heteroscedastic model is generated as 

y,["^ ~ N (0, S(a;,)). (38) 

Our hypcrparameters and initialization scheme are exactly as in Section 4. The only 
difference is that we use truncation levels k* = L* = 5 based on an initial analysis with 
k* = L* = 17. For each replicate, we once again run 10,000 Gibbs iterations and thin the 
chain by examining every 10th sample. A mean estimate of is displayed in Figure 4(b). 
In Figure 5, we plot the mean and 95% highest posterior density intervals of the Frobenius 
norm ||E( r ' m )(a;) — S(a;)||2 aggregated over iterations r = 9,000, . . . , 10,000 and replicates 
m = 1, . . . , 30. The average norm error over X is around 3, which is equivalent to each 
element of the inferred £( r,m )(x) deviating from the true T,(x) by 0.1. Since the covariance 
elements are approximately in the range of [—1,1] and the variances in [0,3], these norm 
error values indicate very good estimation performance. 

We compare our performance to that of the Wishart matrix discounting model (see 
Section 10.4.2 of Prado and West [2010]), which is commonly used in stochastic volatility 
modeling of financial time series. Let $t = S^ 1 . The Wishart matrix discounting model 
is a discrete-time covariance evolution model that accounts for the slowly changing covari- 
ance by discounting the cumulated information. Specifically, assume $t-i | Vi-.t—iiP ~ 
W{h t -\,Dt\), with D t = /3D t -i + yty't and h t = /3h t -i + 1. The discounting model then 
specifies 

$ t | yi .. t -i,P ~ W(fifh-i, (PDt-iV 1 ) (39) 

such that E[<& t \ Vi-.t-i] = E[$t-i I Vut-i] = ht-xD^_ x , but with certainty discounted 
by a factor determined by /?. The update with observation y t is conjugate, maintaining a 
Wishart posterior on $ t . A limitation of this construction is that it constrains h t > p — 1 (or 
h t integral) implying that j3 > (p — 2)/{p — 1). We set h n = 40 and ,8 = 1— 1/ho such that 
h t = 40 for all t and ran the forward filtering backward sampling (FFBS) algorithm outlined 
in Prado and West [2010], generating 100 independent samples. A mean estimate of S(x) 
is displayed in Figure 4(c) and the Frobenius norm error results are depicted in Figure 5. 
Within the region x = 1, . . . , 400, we sec that the error of the Wishart matrix discounting 
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Fig. 5. (a) Plot of the mean and 95% highest posterior density intervals of the Frobenius norm 
||E (T ' m) (x) _ E(x)||2 for the proposed Bayesian nonparametric covariance regression model (blue 
and green) and the Wishart matrix discounting model (red and green). The results are aggregated 
over 1 00 posterior samples and replicates m = 1, . . . , 30. For the Bayesian nonparametric covari- 
ance regression model, these samples are taken at iterations r = [9000 : 10 : 10000]. (b) Analogous 
plot, but zoomed in to more clearly see the differences over the range of x = 1, ... , 400. 

method is approximately twice that of our proposed methodology. Furthermore, towards the 
end of the time series (interpreting X as representing a batch of time), the estimation error 
is especially poor due to errors accumulated in forward filtering. Increasing ht mitigates 
this problem, but shrinks the model towards homosccdasticity. In general, the formulation 
is sensitive to the choice of h t , and in high-dimensional problems this degree of freedom is 
forced to take large (or integral) values. 



5. Applications 

We applied our Bayesian nonparametric covariance regression model to the problem of cap- 
turing spatio-temporal structure in influenza rates in the United States (US). Surveillance 
of influenza has been of growing interest following a series of pandemic scares (e.g., SARS 
and avian flu) and the 2009 H1N1 pandemic, previously known as "swine flu". Although 
influenza pandemics have a long history, such as the 1918-1919 "Spanish flu", 1957-1958 
"Asian flu", and 1968-1969 "Hong Kong flu", a convergence of factors are increasing the 
current public interest in influenza surveillance. These include both practical reasons such 
as the rapid rate by which geographically distant cases of influenza can spread worldwide, 
along with other driving factors such as an increased media coverage. 



5.1. CDC Influenza Monitoring 

The surveillance of influenza within the US is coordinated by the Centers for Disease Con- 
trol and Prevention (CDC), which collects data from a large network of diagnostic labora- 
tories, hospitals, clinics, individual healthcare providers, and state health departments (see 
http://www.cdc.gov/flu/weekly/). The approximately 3,000 participating outpatient sites, 
collectively referred to as the US Outpatient Influenza- Like Illness Surveillance Network 
(ILINet), provide the CDC with key information about rates of influenza-like illness (ILI)^f. 
The CDC consolidates the ILINet observed cases and produces reports for 10 geographic 
regions in addition to a US aggregate rate based on a population-based weighted average of 

% An influenza-like illness (ILI) is defined as any case of a person having over 100 degrees Fahren- 
heit fever along with a cough and/or sore throat in absence of any other known cause. 
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Fig. 6. (a) Plot of the number of isolates tested positive by the WHO and NREVSS over the period 
of September 29, 2003 to May 23, 2010. The isolates are divided into virus subtypes, specifically 
influenza A (H3N2, H1 = {H1N2 and H1N1}, 2009 H1N1) and influenza B. The isolates where 
subtyping was not performed or was not possible are also indicated, (b) Plot of posterior means of 
the nonparametric mean function fj,j(x) for each of the 183 states and regions in the Google Trends 
Flu dataset. The thick yellow line indicates the Google Flu Trends estimate of the United States 
influenza rates, (c) For New York, the 25th, 50th, and 75th quantiles of correlation with the 1 82 other 
states and regions based on the posterior mean t,(x) of the covariance function. The black line is a 
scaled version of the United States influenza rates, as in (b), shown for easy comparison. The green 
line shown in plots (a)-(c) indicates the time periods determined to flu events. Specifically, Event A 
corresponds to the 2003-2004 flu season (flu shot shortage), Event B the 2004-2005 season, Event 
C the 2005-2006 season (avian flu scare), Event D the 2006-2007 season, Event E the 2007-2008 
season (severe), and Event F the 2009-201 season (2009 H1 N1 or "swine flu"). 

state-level rates. The CDC weekly flu reports are typically released after a 1-2 week delay 
and are subject to retroactive adjustments based on corrected ILINet reports. 

A plot of the number of isolates tested positive by the WHO and NREVSS from 2003- 
2010 is shown in Figure 6(a). From these data and the CDC weekly flu reports, wc defined 
a set of six events (Events A-F) corresponding to the 2003-2004, 2004-2005, 2005-2006, 

2006- 2007, 2007-2008, and 2009-20f0 flu seasons, respectively. The 2003-2004 flu season 
began earlier than normal, and coincided with a flu vaccination shortage in many states. 
For the vaccination that was available, the CDC found that it was "not effective or had very 
low effectiveness" (http:/ /www. cdc.gov/media/pressrel/fs040f 15.htm). The 2004-2005 and 

2007- 2008 flu seasons were more severe than the 2005-2006 and 2006-2007 seasons. However, 
the 2005-2006 season coincided with an avian flu (H5N1) scare in which Dr. David Narbarro, 
Senior United Nations System Coordinator for Avian and Human Influenza, was famously 
quoted as predicting that an avian flu pandemic would lead to 5 million to 150 million 
deaths. Finally, the 2009-2010 flu season coincides with the emergence of the 2009 HINI 
("swine flu") subtype|| in the United States. 



5.2. Google Flu Trends Dataset 

To aid in a more rapid response to influenza activity, a team of researchers at Google devised 
a model based on Google user search queries that is predictive of CDC ILI rates [Ginsberg 
et al., 2008] — that is, the probability that a random physician visit is related to an influenza- 

|| According to the CDC, "Antigenic characterization of 2009 influenza A (H1N1) viruses indicates 
that these viruses are only distantly related antigenically and genetically to seasonal influenza A 
(HINI) viruses". See http://www.cdc.gov/flu/weekly/weeklyarchives2009-2010/weekly20.htm. 
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like illness. The Google Flu Trends methodology was devised as follows. From the hundreds 
of billions of individual searches from 2003-2008, time series of state-based weekly query 
rates were created for the 50 million most common search terms. The predictive performance 
of a regression on the logit-transformed query rates was examined for each of the 50 million 
candidates and a ranked list was produced that rewarded terms predictive of rates exhibiting 
similar regional variations to that of the CDC data. A massive variable selection procedure 
was then performed to find the optimal combination of query words (based on best fit 
against out-of-sample ILI data), resulting in a final set of 45 ILI-related queries. Using the 
45 ILI-related queries as the explanatory variable, a region-independent univariate linear 
model was fit to the weekly CDC ILI rates from 2003-2007. This model is used for making 
estimates in any region based on the ILI-related query rates from that region. The results 
were validated against the CDC data both on training and test data, with the Google 
reported US and regional rates closely tracking the actual reported rates. 

A key advantage of the Google data (available at http:/ /www. google.org/flutrends/) is 
that the ILI rate predictions are available 1 to 2 weeks before the CDC weekly reports are 
published. Additionally, a user's IP address is typically connected with a specific geographic 
area and can thus provide information at a finer scale than the 10-regional and US aggregate 
reporting provided by the CDC. Finally, the Google reports are not subject to revisions. 
One important note is that the Google Flu Trends methodology aims to hone in on searches 
and rates of such searches that are indicative of influenza activity. A methodology based 
directly on raw search queries might instead track general interest in influenza, waxing and 
waning quickly with various media events. 

We analyze the Google Flu Trends data from the week of September 28, 2003 through 
the week of October 24, 2010, providing 370 observation vectors yi. Each observation vector 
is 183-dimensional with elements consisting of Google estimated ILI rates at the US national 
level, the 50 states, 10 U.S. Department of Health & Human Services surveillance regions, 
and 122 cities. It is important to note, however, that there is substantial missing data with 
entire blocks of observations unavailable (as opposed to certain weeks sporadically being 
omitted). At the beginning of the examined time frame only 114 of the 183 regions were 
reporting. By the end of Year 1, there were 130 regions. These numbers increased to 173, 
178, 180, and 183 by the end of Years 2, 3, 4, and 5, respectively. The high-dimensionality 
and missing data structure make the Google Flu Trends datasct challenging to analyze in 
its entirety with existing heteroscedastic models. As part of an exploratory data analysis, 
in Figure 7 we plot sample estimates of the geographic correlation structure between the 
states during an event period for four representative states. Specifically, we first subtract 
a moving average estimate of the mean (window size 10) and then aggregate the data 
over Events B-F, omitting Event A due to the quantity of missing data. Because of the 
dimensionality of the data (183 dimensions) and the fact that there are only 157 event 
observations, we simply consider the state-level observations (plus District of Columbia), 
reducing the dimensionality to 51. The limited data also impedes our ability to perform 
time-specific sample estimates of geographic correlations. 

5.3. Heteroscedastic Modeling of Google Flu Trends 

Our proposed heteroscedastic model allows one to capture both spatial and temporal 
changes in correlation structure, providing an important additional tool in predicting in- 
fluenza rates. We specifically consider yi ~ J\f((i(xi), T,(xi)) with the nonparametric func- 
tion fj,(xi) = Q£(xi)ip(xi) defining the mean of the ILI rates in each of the 183 regions. For 
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Fig. 7. For each of four geographically distinct states (New York, California, Georgia, and South 
Dakota), plots of correlations between the state and all other states based on the sample covariance 
estimate from aggregating the state-level data during the event periods B-F after subtracting a moving 
average estimate of the mean. Event A was omitted due to an insufficient number of states reporting. 
Note that South Dakota is missing 58 of the 157 Event B-F observations. 

a given week Xi, the spatial correlation structure is captured by the covariance £(xj) = 
<d£_(xi)^(xi)'Q' + So- Temporal changes are implicitly modeled through the proposed co- 
variance regression framework that allows for continuous variations in Dukic et al. 
[2009] also examine portions of the Google Flu Trends data, but with the goal of on-line 
tracking of influenza rates on either a national, state, or regional level. Specifically, they 
employ a state-space model with particle learning. Our goal differs considerably. We aim to 
jointly analyze the full 183-dimcnsional data, as opposed to univariate modeling. Through 
such joint modeling, we can uncover important spatial dependencies lost when analyzing 
components of the data individually. Such spatial information can be key in predicting in- 
fluenza rates based on partial observations from select regions or in retrospectively imputing 
missing data. 

There are a few crucial points to note. The first is that no geographic information is 
provided to our model. Instead, the spatial structure is uncovered simply from analyzing the 
raw 183-dimensional time series and patterns therein. Second, because of the substantial 
quantity of missing data, imputing the missing values as in Section 4.2 is less ideal than 
simply updating our posterior based solely on the data that is available. The latter is how 
we chose to analyze the Google Flu Trends data — our ability to do so without introducing 
any approximations is a key advantage of our proposed methodology. 

The results presented in Figures 6 and 8 clearly demonstrate that we are able to capture 
both spatial and temporal changes in correlations in the Google Flu Trends data, even in the 
presence of substantial missing information. We preprocessed the data by scaling the entire 
dataset by one over the largest variance of any of the 183 time series. The hyperparameters 
were set as in the simulation study of Section 4, except with larger truncation levels L* = 10 
and k* = 20 and with the Gaussian process length-scale hyperparameter set to K = 100 
to account for the time scale (weeks) and the rate at which ILI incidences change. Once 
again, by examining posterior samples of O, we found that the chosen truncation level 
was sufficiently large. We ran 10 chains each for 10,000 MCMC iterations, discarded the 
first 5,000 for burn-in, and then thinned the chains by examining every 10 samples. Each 
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Fig. 8. For each of four geographically distinct states (New York, California, Georgia, and South 
Dakota) and each of three key dates (February 2006 of Event C, February 2008 of Event E, and 
November 2009 of Event F), plots of correlations between the state and all other states based on the 
posterior mean t,(x) of the covariance function. The plots clearly indicate spatial structure captured 
by T,(x), and that these spatial dependencies change over time. Note that no geographic information 
was included in our model. Compare to the maps of Figure 7. 

chain was initialized with parameters sampled from the prior. To assess convergence, we 
performed the modified Gelman-Rubin diagnostic of Brooks and Gelman [1998] on the 
MCMC samples of the variance terms Hjj(xi) for j corresponding to the state indices of 
New York, California, Georgia, and South Dakota, and Xi corresponding to the midpoint 
of each of the 12 event and non-event time windows**. These elements are spatially and 
geographically disparate, with South Dakota corresponding to an element with substantial 
missing data. Of the 48 resulting variables examined (4 states and 12 time points), 40 had 
potential scale reduction factors i? 1 / 2 < 1.2 and most i? 1 / 2 < 1.1. The variables with larger 
R 1 / 2 (all less than 1.4) corresponded to two non-event time periods. We also performed 
hypcrparameter sensitivity, doubling the length-scale parameter to k = 200 (implying less 



**Our tests used the code provided at http://www.lce.hut.fi/research/mm/mcmcdiag/. 
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temporal correlation) and using a larger truncation level of L* = k* = 20 with less stringent 
shrinkage hyperparameters a\ = a 2 = 2 (instead of a\ = ai — 10). The results were very 
similar to those presented in this section, with all conclusions remaining the same. 

In Figure 6(b), wc plot the posterior mean of the 183 components of n(x), showing 
trends that follow the Google estimated US national ILI rate. For New York, in Figure 6(c) 
we plot the 25th, 50th, and 75th quantiles of correlation with the 182 other states and 
regions based on the posterior mean S(a;) of the covariance function. From this plot, we 
immediately notice that regions become more correlated during flu seasons, as we would 
expect. The specific geographic structure of these correlations is displayed in Figure 8, 
where wc see key changes with the specific flu event. In the more mild 2005-2006 season, 
we see much more local correlation structure than the more severe 2007-2008 season (which 
still maintains stronger regional than distant correlations.) The November 2009 H1N1 event 
displays overall regional correlation structure and values similar to the 2007-2008 season, but 
with key geographic areas that are less correlated. Note that some geographically distant 
states, such as New York and California, are often highly correlated as we might expect 
based on their demographic similarities and high rates of travel between them. Interestingly, 
the strong local spatial correlation structure for South Dakota in February 2006 has been 
inferred before any data are available for that state. Actually, no data are available for South 
Dakota from September 2003 to November 2006. Despite this missing data, the inferred 
correlation structures over these years are fairly consistent with those of neighboring states 
and change in manners similar to the flu-to-non-flu changes inferred after data for South 
Dakota are available. 

Comparing the maps of Figure 8 to those of the sample-based estimates in Figure 7, wc 
see much of the same correlation structure, which at a high level validates our findings. Since 
the sample-based estimates aggregate data over Events B-F (containing those displayed in 
Figure 8) , they tend to represent a time-average of the event-specific correlation structure we 
uncovered. Note that due to the dimensionality of the dataset, the sample-based estimates 
are based solely on state-level measurements and thus are unable to harness the richness (and 
crucial redundancy) provided by the other regional reporting agencies. Furthermore, since 
there are a limited number of per-event observations, the naive approach of forming sample 
covariances based on local bins of data is infcasible. The high-dimensionality and missing 
data structure of this dataset also limit our ability to compare to alternative methods such as 
those cited in Section 1 — none yield results directly comparable to the full analysis we have 
provided here. Instead, they are either limited to examination of the small subset of data for 
which all observations are present and/or a lower-dimensional selection (or projection) of 
observations. On the other hand, our proposed algorithm can readily utilize all information 
available to model the heteroscedasticity present here. (Compare to the common GARCH 
models, which cannot handle missing data and are limited to typically no more than 5 
dimensions.) In terms of computational complexity, each of our chains of 10,000 Gibbs 
iterations based on a naive implementation in MATLAB (R2010b) took approximately 12 
hours on a machine with four Intel Xeon X5550 Quad-Core 2.67GHz processors and 48 GB 
of RAM. 



6. Discussion 

In this paper, we have presented a Bayesian nonparametric approach to covariance regres- 
sion which allows an unknown p x p dimensional covariance matrix £(x) to vary flexibly 
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over x € X, where X is some arbitrary (potentially multivariate) predictor space. Foun- 
dational to our formulation is quadratic mixing over a collection of dictionary elements, 
assumed herein to be Gaussian process random functions, defined over X . By inducing a 
prior on E# = {H(x),x € X} through a shrinkage prior on a predictor-dependent latent 
factor model, we are able to scale to the large p domain. Our proposed methodology also 
yields computationally tractable algorithms for posterior inference via fully conjugate Gibbs 
updates — this is crucial in our being able to analyze high-dimensional multivariate datasets. 
We demonstrated the utility of our Bayesian covariance regression framework through both 
simulated studies and analysis of the Google Trends flu dataset, the latter having nearly 
200 dimensions. 

There are many possible interesting and relatively direct extensions of the proposed 
covariance regression framework. The most immediate are those that fall into the categories 
of (i) addressing the limitations of our current assumption of Gaussian marginals, and (ii) 
scaling to datasets with large numbers of observations. 

In longitudinal studies or spatio-temporal datasets, one is faced with repeated collections 
of observations over the predictor space. These collections arc clearly not independent. To 
cope with such data, one could envision embedding the current framework within a hier- 
archical model (e.g., to model random effects on a mean), or otherwise use the proposed 
methodology as a building block in more complicated models. Additionally, it would be 
trivial to extend our framework to accommodate multivariate categorical responses, or joint 
categorical and continuous responses, by employing the latent variable probit model of Al- 
bert and Chib [1993]. This would lead, for example, to a more flexible class of multivariate 
probit models in which the correlation between variables can change with time and other 
factors. For computation, the use of parameter expansion allows us to simply modify our 
current MCMC algorithm to include a data augmentation step for imputing the underlying 
continuous variables. Imposing the constraints on the covariance could be deferred to a 
post-processing stage. Another interesting direction for future research is to consider em- 
bedding our covariance regression model in a Gaussian copula model. One possible means 
of accomplishing this is through a variant of the approach proposed by Hoff [2007], which 
avoids having to completely specify the marginal distributions. 

As discussed in Section 3.4, our sampler relies on L * xfc* draws from an n-dimensional 
Gaussian (i.e., posterior draws of our Gaussian process random dictionary functions). For 
very large n, this becomes infeasible in practice since computations are, in general, C>(n 3 ). 
Standard tools for scaling up Gaussian process computation to large datasets, such as co- 
variance tapering [Kaufman et al., 2008, Du et al., 2009] and the predictive process [Banerjee 
et al., 2008], can be applied directly in our context. Additionally, one might consider using 
the integrated nested Laplace approximations of Rue et al. [2009] for computations. The 
size of the dataset (both in terms of n and p) also dramatically affects our ability to sample 
the Gaussian process length-scale hyperparamcter k since our proposed method relies on 
samples from an rip-dimensional Gaussian. See Section 3.3 for details and possible methods 
of addressing this issue. If it is feasible to perform inference over the length-scale parameter, 
one can consider implicitly including a test for homoscedasticity by considering n taking 
values in the extended reals (i.e., 5i[J{oo}) and thus allowing our formulation to collapse 
on the simpler model if k = oo. 

Finally, we note that there are scenarios in which the functional data itself is covariance- 
valued, such as in diffusion tensor imaging. In this case, each voxel in an image consists 
of a 3 x 3 covariance matrix that has potential spatio-temporal dependencies. Specifically, 
take to represent the observed covariance matrix for subject i at pixel j and time t. 
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Here, one could imagine replacing the Gaussian process dictionary elements with splines and 
embedding this model within a hierarchical structure to allow variability among subjects 
while borrowing information. 

As we readily see, the presented Bayesian nonparametric covariance regression frame- 
work easily lends itself to many interesting directions for future research with the potential 
for dramatic impact in many applied domains. 
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A. Proofs of Theorems and Lemmas 

Proof (Proof of Theorem 2.1). Since X is compact, for every e > there exists 
an open covering of eo-balls B eo (xo) = {x : \\x — X0W2 < e o} with a finite subcover such that 
Ux £X B co( x o) 3 X , where \X \ = n. Then, 



n E sup||E(s)-E*(ar)|| 2 <e =n E max sup ||X3(x) - S*(x)|| 2 < e . (40) 

\x<£X J \x eX x £B €0 (x ) J 

Define Z(x ) = sup^^^ ||E(x) - E*(x)|| 2 . Since 



n E max Z(x ) < e > n s (Z{x ) < e) > 0, Vx G X , (41) 

\x GX, J 

we only need to look at each eo-ball independently, which we do as follows, 
lis ( sup ||S(a:)-E*(x)||2<e) 

\x&B €0 (x ) J 

>n E ( ||E(x )-E*( a ;o)|| 2 + sup ||E*(a:o)-E*(a:)||2+ sup ||E(x ) - E(z)|| 2 < e ) 
>n s (||E( a;o )~E*(x )|| 2 <e/3) 

•n s ( sup ||£*(xo)-E*(x)|| a <e/3J .n s [ sup ||E(x ) - E(x)|| 2 < e/3 ] , 

\xeB ca (x ) J \x£B ca (x ) J 

(42) 

where the first inequality comes from repeated uses of the triangle inequality, and the second 
inequality follows from the fact that each of these terms is an independent event. We 
evaluate each of these terms in turn. The first follows directly from the assumed continuity 
of E*(-). The second will follow from a statement of (almost sure) continuity of E(-) that 
arises from the (almost sure) continuity of the ~ GP(0,c) and the shrinkage prior on 

8(k (i.e., 8ik — > almost surely as £ — > 00, and does so "fast enough".) Finally, the third will 
follow from the support of the conditionally Wishart prior on E(a?o) at every fixed xq G X. 
Based on the continuity of E*(-), for all e/3 > there exists an eo.i > such that 

||S*(as ) — S*(sb)I|2 < e/3, V||ar - z || 2 < e ,i- (43) 
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Therefore, n E (sup^ ^ ( , o) ||E*(.t ) - E*(x)|| 2 < e/3) = 1. 

Based on Theorem 2.2, each element of A(-) = 0£(-) is almost surely continuous on X 
assuming k finite. Letting ffjjt(x) = [A(x)]jfc, 

k 

[A(x)A(x)']ij = 9im(x)g jm (x), Vx e X. (44) 

m— 1 

Eq. (44) represents a finite sum over pairwise products of almost surely continuous functions, 
and thus results in a matrix A(x)A(x)' with elements that are almost surely continuous on 
X. Therefore, E(x) = A(x)A(x)' + S = Q£(x)£(x)'& + E is almost surely continuous on 
X. We can then conclude that for all e/3 > there exists an eo, 2 > such that 

n s ( sup ||E(xo)-E(x)|| 2 < e/3 ] = 1. (45) 

To examine the third term, we first note that 
n s (||E(x )-E*(x )|| 2 <e/3) 

= n s (\\Qs( Xo )t(x y& + s - ev(so)r(*o)'e*' + e*|| 2 < e/3) , (46) 



where {£*(xo), 9*, Eq} is any element of X^Xq^>X^ such that E*(x ) = 9*£*(xo)£*(xo)'0* 
Eq. Such a factorization exists by Lemma 2.1. We can then bound this prior probability by 

n s (||E(x )-E*(x )|| 2 < e/3) 

> n E (\\9t(xo)Z(xo)'Q' - e*e(x )C(x ye*'\\ 2 < e/e) n So (||e - e*|| 2 < e/e) 

> n s (||ec(x )c(xo)'e' - e*rO=o)£* (ao)'e*'|| 2 < e/e) n Eo (||s - esiu < e/(6V5F)) , 

(47) 

where the first inequality follows from the triangle inequality, and the second from the 
fact that for all A <G 5R pxp , ||A|| 2 < V^H^IU' with tlie sup-norm defined as \\A\loo = 

maxKKp ^[ =1 \o-ij\- Since Eo = diag(crf, . . . , er^) with of l '^ > " Ga(a (T ,6 CT ), the support of 
the gamma prior implies that 

n Eo (||E - ES| |oo < e/(6V5F)) = n So (max \af - < 2 | < e/(6^)) > 0. (48) 

Recalling that [£(xo)]ffc = &fc(xo) with £ek(xo) ' ~ ' Af(0, 1) and taking 6 e 5R pxi with 
rank(9) = p, 

0£(xo)£(xo)'0' I e ~ w(fc, ee'). (49) 

When k > p, 0£(xo)£(xo)'0' is invertible (i.e., full rank) with probability 1. 

By Assumption 2.2, there is positive probability under Lie on the set of such that 
rank(0) = p. Since O*£*(xo)£*(xo)'0* is an arbitrary symmetric positive semidcfimtc 
matrix in W xp , and based on the support of the Wishart distribution, 

n s (\m(x a )ax YQ' - Q*e(x )e(x ye*'\\2 < e/e) > o. (50) 
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We thus conclude that II E (||S(x ) - S*(x )|| 2 < e/3) > 0. 

For every £*(•) and e > 0, let eo = min(eo,i, eo, 2 ) with eo,i and eo,2 defined as above. 
Then, combining the positivity results of each of the three terms in Eq. (42) completes the 
proof. 

Proof (Proof of Theorem 2.2). We can represent each element of A(-) as follows: 



[A(-)]ifc = lim 

L— ¥00 



711 (712 
5*21 $22 

1 ■pi 9p2 



'fll(-) 62(0 

.€u(-) a 2 (-) 



&*(•)' 
&*(■) 

MO 



y^^^(-)- 



£=1 



3* 



(51) 



If &fc(aj) is continuous for all £,k and s n (a;) = ^^\9jlt,lk(%) uniformly converges almost 
surely to some gjk{x), then gjk(%) is almost surely continuous. That is, if for all e > there 
exists an N such that for all n > N 



(52) 



Pr sup \gjk(x) ~ s„(x)| < e = 1, 



then s n (x) converges uniformly almost surely to gjk{x) and we can conclude that gjk(x) is 
continuous based on the definition of s n (x). To show almost sure uniform convergence, it 
is sufficient to show that there exists an M n with X^^Li M n almost surely convergent and 



sup \0jn£nk(x)\ < M n 



x£X 



Let c nk = sup xgA , \^nk(x)\- Then, 



Sup \6 jn £ nk (x)\ < \9jn\Cnk- 
x£X 



(53) 



(54) 



Since £ n fc( - ) * ~ ' GP(0, c) and A' is compact, c n k < 00 and -E[c„fc] = c with c finite. Defining 
M n = \0jn\cnk, 



E M « 



n=l 



So 



|B 



.71=1 



n=l 



= c^£; e [|^„|], (55) 



where the last equality follows from Fubini's theorem. Based on Assumption 2.1, we con- 
clude that £?E^Li -^n] < 00 which implies that X)«=i converges almost surely. 

Proof (Proof of Lemma 2.2). Recall that 0# - 7V(0, ^"/t" 1 ) with <^ ~ Ga(3/2,3/2) 

and T£ = Yih=i f or Si ~ Ga(oi, 1), Sh ~ Ga(a2, 1). Using the fact that if a; ~ A/"(0, <r) then 
E[|a;|] = ctv^Ttt and if y ~ Ga(a, 6) then 1/y ~ Inv-Ga(a, 1/6) with E[l/y) = 1/(6- (a- 1)), 
we derive that 



y^Ee[|flji|] - y^g0, T [£^e|0 ;r [|6i 



£=1 



00 00 



"=1 



n 



1 4 /2 



E 



Oi — 1 3 V 7T ^-^ 02 — 1 

£=1 



(56) 



When 02 > 2, we conclude that < 00 • 
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Proof (Proof of Lemma 2.3). Recall that E(as) = Q£(x)£{x)'Q' + E with E = 
diag((Tj, . . . , <ip). The elements of the respective matrices are independently distributed as 
it ~ A/"(0, <^ Tg~ ), ^fe(-) ~ GP(0, c), and ct~ 2 ~ Gamma(a„, 6 CT ). Let \i„ and ct 2 . represent 
the mean and variance of the implied inverse gamma prior on erf , respectively. In all of the 
following, we first condition on and then use iterated expectations to find the marginal 
moments. 

The expected covariance matrix at any predictor location x is simply derived as 

E[E(x)} = E[E\E(x) | 6]] = E[E[QZ(x)Z(x)'Q' \ 0]] +£t CT I p 
= kE[QQ'} + nJp 

= diag(fc Y 4>u T i X + M<r, • • • , k Y <^i T i X + /v)- 
i i 

Here, we have used the fact that conditioned on 0, 0£(x)£(x)'0' is Wishart distributed 
with mean fc00' and 

£[00% = y E = Y E t e ^ 

if i 

= Y / var(6tf)8 ij = \ Y^it T iJ 

Proof (Proof of Lemma 2.4). One can use the conditionally Wishart distribution 
of Q£(x)£(x)'& to derive cov(T, ij (x),'E 11v (x)). Specifically, let S = 0£(a;)£(a;)'0'. Then 
s = En=i z (n) z^y with | - A/"(O,00') independently for each n. Then, using 
standard Gaussian second and fourth moment results, 

cov(£y(a;), T, uv (x) | 0) = cov(,%, S uv | 0) + al6 ijuv 

= Y E[z^zfz^z^ | 0] - E[z^zf | 0]S[4 n) 4 n) I 6] + olk y 

n=l 

= k((QQ') iu (QQ') jv + (Qe') iv (QQ') ju ) + a 2 J lJuv . 

Here, 8ij uv = lifi=j=u = v and is otherwise. Taking the expectation with respect to 
yields cov(£y (x), H uv (x)). However, instead of looking at one slice of the predictor space, 
what we are really interested in is how the correlation between elements of the covariance 
matrix changes with predictors. Thus, we work directly with the latent Gaussian processes 
to derive cov(£ij(x), H uv (x')). Let 

9in (x) = Y Qil&n {%), (57) 

I 

implying that <7m(x) is independent of all gi m (x') for any m ^ n and all x' G X. Since 
each ^i n (-) is distributed according to a zero mean Gaussian process, gi n {x) is zero mean. 
Using this definition, we condition on (which is dropped in the derivations for notational 
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simplicity) and write 

fc 

cov('Sij(x),'E uv (x') | 9) = cov(g in (x)g jn (x), g un (x') ')) + <j a S 

n=l 

k 

= ^ E[g in (x)g jn (x)g un (x') )] - E[g in (x)g jn (x))E[ ')] + a a 5 

ijuv 

n=l 

We replace each gkn{x) by the form in Equation (57), summing over different dummy indices 
for each. Using the fact that £,g n (x) is independent of &' n (x') for any I ^ £' and that each 
£,tn {x) is zero mean, all cross terms in the resulting products cancel if a ^g n (x) arising from 
one gkn{x) does not share an index £ with at least one other £,£ n (x) arising from another 
g pn (x). Thus, 

fc 

cov(£ i:) ( 

) i ) = EE^^^^w^^')] 

n=l £ 

+ ^20 u e ue E[i l , n {x)£, l , n {x')] Y 6je'Qvi'E[& n (x)&n(x')] 

ijuv 

e e 
The Gaussian process moments arc given by 

mi(x)} = i 

E[^ n (x)^ n (x')} = E[E[& n (x) | & n {x')]U n {x')] = c(x,x')El$ n (x')] = c(x,x') 

E[e ln (x)e ln {x')]=E[E[e tn (x) \ 

= E{{(E[& n (x) | Un{x')]) 2 + var(&„(x) | &„(x'))R, 2 „(z')] 

= c\x,x')E[e, n {x')] + (1 - ci(x,x>))Eie en (x>)] = 2c*(x,x') + 1, 

from which we derive that 

cov(Sy(x), Z U v(x') I 6) = fc[(2c 2 (x, x') + 1) ^ QmQjMvl + c 2 (z, a/) ^ ^ 0, y 0^ 

+ ^ ^ @u£'9vi' — Y^ Qi^jt Y^ ^ ui '^ vi ' \ + a "a&\ 

= kc 2 (x, x') I "Y^ 6u9ji0ui6vi + *Y< 0j£i9 v £i 
{ e ti 1 



J IJUV 



® a^ijuv ■ 



An iterated expectation with respect to 9 yields the following results. When i / 11 
or j ^ v, the independence between 6u (or 9ji) and the set of other 9ki implies that 
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cov(£jj(x), Yi uv (x')) = 0. When i = u and j = v, but i =/= j, 

I e £ i< ) 

= kc\x, x') j]T ^v-v- 2 + £ ^ 0-i T -i I . 

Finally, when i = j = u = v, 

cov(Z u (x), E,,(x')) = kc 2 (x, x')\2j2 Em + £ E[e%] £ E[e%,] I + a 2 

= kc 2 (x, x') J 6 ^ ^V 2 + 2 <fe V' E 1 | + 



B. Derivation of Gibbs Sampler 

In this Appendix, we derive the conditional distribution for sampling the Gaussian process 
dictionary elements. Combining Eq. (3) and Eq. (4), we have that 



2/i = e 



_£,Ll(Xi) £,L2{Xi) 

implying that 



= e 



Vij 



^lk(Xi) 
£,2k{Xi) 

L k 

/~] ^ 9jlVirntilm(Xi) + £j 
t=l m=l 



-'Em=l^Lm(Xi)v 



Lin. 



(58) 



(59) 



Conditioning on £(■) , we rewrite Eq. (58) as 



7 i/ 



Let 9. e = [9 1 
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Then, 
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(60) 
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Defining A(_ m = diag(r/i m 0.,e, . . . , r) nm 6.i), our Gaussian process prior on the dictionary 
elements £g m (") implies the following conditional posterior 



S,trn{x2) 



| {y i },e,r ? ,C(-) ! So~AT 



yi 



\ 



= TV S 



11™ ELi^ ct ,- 2 yi/ 



,1™ Si=i 2 ynj. 



(62) 



where j/j = — /J-em(xi) and, taking _ftT to be the matrix of correlations Kij = c(xi,Xj) 
defined by the Gaussian process parameter k, 



XT 1 = K- 1 + AL 



A em = K 1 + diag | nl m e %0j 

j'=i 



) * • * ; 'Inm 



(63) 
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